Search CORE

27 research outputs found

Engineering External Memory LCP Array Construction: Parallel, In-Place and Large Alphabet

Author: Kempa Dominik
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 16th International Symposium on Experimental Algorithms (SEA 2017)
Publication date: 01/01/2017
Field of study

The suffix array augmented with the LCP array is perhaps the most important data structure in modern string processing. There has been a lot of recent research activity on constructing these arrays in external memory. In this paper, we engineer the two fastest LCP array construction algorithms (ESA 2016) and improve them in three ways. First, we speed up the algorithms by up to a factor of two through parallelism. Just 8 threads is sufficient for making the algorithms essentially I/O bound. Second, we reduce the disk space usage of the algorithms making them in-place: The input (text and suffix array) is treated as read-only and the working disk space never exceeds the size of the final output (the LCP array). Third, we add support for large alphabets. All previous implementations assume the byte alphabet

Dagstuhl Research Online Publication Server

Faster External Memory LCP Array Construction

Author: Kempa Dominik
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 24th Annual European Symposium on Algorithms (ESA 2016)
Publication date: 01/01/2016
Field of study

The suffix array, perhaps the most important data structure in modern string processing, needs to be augmented with the longest-common-prefix (LCP) array in many applications. Their construction is often a major bottleneck especially when the data is too big for internal memory. We describe two new algorithms for computing the LCP array from the suffix array in external memory. Experiments demonstrate that the new algorithms are about a factor of two faster than the fastest previous algorithm

Dagstuhl Research Online Publication Server

Lempel-Ziv Parsing in External Memory

Author: Kempa Dominik
Kärkkäinen Juha
Puglisi Simon J.
Publication venue
Publication date: 04/07/2013
Field of study

For decades, computing the LZ factorization (or LZ77 parsing) of a string has been a requisite and computationally intensive step in many diverse applications, including text indexing and data compression. Many algorithms for LZ77 parsing have been discovered over the years; however, despite the increasing need to apply LZ77 to massive data sets, no algorithm to date scales to inputs that exceed the size of internal memory. In this paper we describe the first algorithm for computing the LZ77 parsing in external memory. Our algorithm is fast in practice and will allow the next generation of text indexes to be realised for massive strings and string collections.Comment: 10 page

arXiv.org e-Print Archive

Crossref

Faster Sparse Suffix Sorting

Author: I Tomohiro
Kempa Dominik
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 31st International Symposium on Theoretical Aspects of Computer Science (STACS 2014)
Publication date: 01/01/2014
Field of study

The sparse suffix sorting problem is to sort b=o(n) arbitrary suffixes of a string of length n using o(n) words of space in addition to the string. We present an O(n) time Monte Carlo algorithm using O(b.log(b)) space and an O(n.log(b)) time Las Vegas algorithm using O(b) space. This is a significant improvement over the best prior solutions of [Bille et al., ICALP 2013]: a Monte Carlo algorithm running in O(n.log(b)) time and O(b^(1+e)) space or O(n.log^2(b)) time and O(b) space, and a Las Vegas algorithm running in O(n.log^2(b)+b^2.log(b)) time and O(b) space. All the above results are obtained with high probability not just in expectation

Dagstuhl Research Online Publication Server

LZ-End Parsing in Linear Time

Author: Kempa Dominik
Kosolobov Dmitry
Publication venue: Schloss Dagstuhl - Leibniz-Zentrum für Informatik
Publication date: 01/01/2017
Field of study

Peer reviewe

Dagstuhl Research Online Publication Server

Helsingin yliopiston digitaalinen arkisto

Engineering External Memory LCP Array Construction: Parallel, In-Place and Large Alphabet

Author: Kempa Dominik
Kärkkäinen Juha
Publication venue: Schloss Dagstuhl - Leibniz-Zentrum für Informatik
Publication date: 01/01/2017
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Grammar Boosting: A New Technique for Proving Lower Bounds for Computation over Compressed Data

Author: De Rajat
Kempa Dominik
Publication venue
Publication date: 17/07/2023
Field of study

Grammar compression is a general compression framework in which a string

T

of length

N

is represented as a context-free grammar of size

n

whose language contains only

T

. In this paper, we focus on studying the limitations of algorithms and data structures operating on strings in grammar-compressed form. Previous work focused on proving lower bounds for grammars constructed using algorithms that achieve the approximation ratio

\rho=\mathcal{O}(\text{polylog }N)

. Unfortunately, for the majority of grammar compressors,

\rho

is either unknown or satisfies

\rho=\omega(\text{polylog }N)

. In their seminal paper, Charikar et al. [IEEE Trans. Inf. Theory 2005] studied seven popular grammar compression algorithms: RePair, Greedy, LongestMatch, Sequential, Bisection, LZ78, and

\alpha

-Balanced. Only one of them (

\alpha

-Balanced) is known to achieve

\rho=\mathcal{O}(\text{polylog }N)

. We develop the first technique for proving lower bounds for data structures and algorithms on grammars that is fully general and does not depend on the approximation ratio

\rho

of the used grammar compressor. Using this technique, we first prove that

\Omega(\log N/\log \log N)

time is required for random access on RePair, Greedy, LongestMatch, Sequential, and Bisection, while

\Omega(\log\log N)

time is required for random access to LZ78. All these lower bounds hold within space

\mathcal{O}(n\text{ polylog }N)

and match the existing upper bounds. We also generalize this technique to prove several conditional lower bounds for compressed computation. For example, we prove that unless the Combinatorial

k

-Clique Conjecture fails, there is no combinatorial algorithm for CFG parsing on Bisection (for which it holds

\rho=\tilde{\Theta}(N^{1/2})

) that runs in

\mathcal{O}(n^c\cdot N^{3-\epsilon})

time for all constants

c>0

and

\epsilon>0

. Previously, this was known only for

c<2\epsilon

arXiv.org e-Print Archive

Fast and Space-Efficient Construction of AVL Grammars from the LZ77 Parsing

Author: Kempa Dominik
Langmead Ben
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 29th Annual European Symposium on Algorithms (ESA 2021)
Publication date: 01/01/2021
Field of study

Grammar compression is, next to Lempel-Ziv (LZ77) and run-length Burrows-Wheeler transform (RLBWT), one of the most flexible approaches to representing and processing highly compressible strings. The main idea is to represent a text as a context-free grammar whose language is precisely the input string. This is called a straight-line grammar (SLG). An AVL grammar, proposed by Rytter [Theor. Comput. Sci., 2003] is a type of SLG that additionally satisfies the AVL property: the heights of parse trees for children of every nonterminal differ by at most one. In contrast to other SLG constructions, AVL grammars can be constructed from the LZ77 parsing in compressed time: ?(z log n) where z is the size of the LZ77 parsing and n is the length of the input text. Despite these advantages, AVL grammars are thought to be too large to be practical. We present a new technique for rapidly constructing a small AVL grammar from an LZ77 or LZ77-like parse. Our algorithm produces grammars that are always at least five times smaller than those produced by the original algorithm, and usually not more than double the size of grammars produced by the practical Re-Pair compressor [Larsson and Moffat, Proc. IEEE, 2000]. Our algorithm also achieves low peak RAM usage. By combining this algorithm with recent advances in approximating the LZ77 parsing, we show that our method has the potential to construct a run-length BWT in about one third of the time and peak RAM required by other approaches. Overall, we show that AVL grammars are surprisingly practical, opening the door to much faster construction of key compressed data structures

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Breaking the $O(n)$ -Barrier in the Construction of Compressed Suffix Arrays

Author: Kempa Dominik
Kociumaka Tomasz
Publication venue
Publication date: 23/06/2021
Field of study

The suffix array, describing the lexicographic order of suffixes of a given text, is the central data structure in string algorithms. The suffix array of a length-

n

text uses

\Theta(n \log n)

bits, which is prohibitive in many applications. To address this, Grossi and Vitter [STOC 2000] and, independently, Ferragina and Manzini [FOCS 2000] introduced space-efficient versions of the suffix array, known as the compressed suffix array (CSA) and the FM-index. For a length-

n

text over an alphabet of size

\sigma

, these data structures use only

O(n \log \sigma)

bits. Immediately after their discovery, they almost completely replaced plain suffix arrays in practical applications, and a race started to develop efficient construction procedures. Yet, after more than 20 years, even for

\sigma=2

, the fastest algorithm remains stuck at

O(n)

time [Hon et al., FOCS 2003], which is slower by a

\Theta(\log n)

factor than the lower bound of

\Omega(n / \log n)

(following simply from the necessity to read the entire input). We break this long-standing barrier with a new data structure that takes

O(n \log \sigma)

bits, answers suffix array queries in

O(\log^{\epsilon} n)

time, and can be constructed in

O(n\log \sigma / \sqrt{\log n})

time using

O(n\log \sigma)

bits of space. Our result is based on several new insights into the recently developed notion of string synchronizing sets [STOC 2019]. In particular, compared to their previous applications, we eliminate orthogonal range queries, replacing them with new queries that we dub prefix rank and prefix selection queries. As a further demonstration of our techniques, we present a new pattern-matching index that simultaneously minimizes the construction time and the query time among all known compact indexes (i.e., those using

O(n \log \sigma)

bits).Comment: 41 page

arXiv.org e-Print Archive

Perushakurakenteiden tehokas muodostus suurille tekstimassoille

Author: Kempa Dominik
Publication venue: 'University of Helsinki Libraries'
Publication date: 02/10/2015
Field of study

This thesis studies efficient algorithms for constructing the most fundamental data structures used as building blocks in (compressed) full-text indexes. Full-text indexes are data structures that allow efficiently searching for occurrences of a query string in a (much larger) text. We are mostly interested in large-scale indexing, that is, dealing with input instances that cannot be processed entirely in internal memory and thus a much slower, external memory needs to be used. Specifically, we focus on three data structures: the suffix array, the LCP array and the Lempel-Ziv (LZ77) parsing. These are routinely found as components or used as auxiliary data structures in the construction of many modern full-text indexes. The suffix array is a list of all suffixes of a text in lexicographical order. Despite its simplicity, the suffix array is a powerful tool used extensively not only in indexing but also in data compression, string combinatorics or computational biology. The first contribution of this thesis is an improved algorithm for external memory suffix array construction based on constructing suffix arrays for blocks of text and merging them into the full suffix array. In many applications, the suffix array needs to be augmented with the information about the longest common prefix between each two adjacent suffixes in lexicographical order. The array containing such information is called the longest-common-prefix (LCP) array. The second contribution of this thesis is the first algorithm for computing the LCP array in external memory that is not an extension of a suffix-sorting algorithm. When the input text is highly repetitive, the general-purpose text indexes are usually outperformed (particularly in space usage) by specialized indexes. One of the most popular families of such indexes is based on the Lempel-Ziv (LZ77) parsing. LZ77 parsing is the encoding of text that replaces long repeating substrings with references to other occurrences. In addition to indexing, LZ77 is a heavily used tool in data compression. The third contribution of this thesis is a series of new algorithms to compute the LZ77 parsing, both in RAM and in external memory. The algorithms introduced in this thesis significantly improve upon the prior art. For example: (i) our new approach for constructing the LCP array in external memory is faster than the previously best algorithm by a factor of 2-4 and simultaneously reduces the disk space usage by a factor of four; (ii) a parallel version of our improved suffix array construction algorithm is able to handle inputs much larger than considered in the literature so far. In our experiments, computing the suffix array of a 1 TiB file with the new algorithm took a little over a week and required only 7.2 TiB of disk space (including input and output), whereas on the same machine the previously best algorithm would require 3.5 times as much disk space and take about four times longer.Tutkielman aiheena olevilla algoritmeilla voidaan tehokkaasti muodostaa perustietorakenteita, joita käytetään rakennuspalikoina (tiivistetyissä) tekstihakurakenteissa. Tekstihakurakenteet ovat tietorakenteita, jotka mahdollistavat tehokkaat merkkijonohaut tekstissä. Pääasiallisena kiinnostuksen kohteena ovat algoritmit suurille tekstimassoille, joita ei pystytä käsittelemään keskusmuistissa, ja jotka siksi vaativat paljon hitaamman ulkoisen muistin käyttöä. Kohdetietorakenteita on kolme: loppuosataulukko, LCP-taulukko ja Lempel-Ziv (LZ77) jäsennys. Näitä käytetään laajasti komponentteina tai välivaiheina modernien tekstihakurakenteiden muodostamisessa. Loppuosataulukko listaa tekstin kaikki loppuosat aakkosjärjestyksessä. Yksinkertaisuudestaan huolimatta loppuosataulukko on tehokas työkalu, jota käytetään laajalti ei vain tekstihakurakenteissa vaan myös tekstintiivistyksessä, merkkijonokombinatoriikassa ja laskennallisessa biologiassa. Tutkielman ensimmäinen tulos on parannettu algoritmi loppuosataulukon muodostamiseen ulkoisessa muistissa perustuen tekstin osille muodostettujen loppuosataulukkojen yhdistämiseen koko tekstin loppuosataulukoksi. Monissa sovelluksissa loppuosataulukon rinnalla tarvitaan tietoa aakkosellisesti vierekkäisten loppuosien pisimpien yhteisten alkuosien pituuksista. Tämän tiedon sisältävää taulukkoa sanotaan LCP (longest common prefix) taulukoksi. Tutkielman toinen tulos on ensimmäinen LCP taulukon ulkoisessa muistissa muodostava algoritmi, joka ei ole loppuosataulukonmuodostusalgoritmin laajennus. Vahvasti toisteiselle tekstille on olemassa erikoistuneita tekstihakurakenteita, jotka ovat yleiskäyttöisiä tekstihakurakenteita tehokkaampia (erityisesti muistin käytön suhteen). Yksi suosituimmista tällaisista hakurakenneperheistä perustuu Lempel-Ziv (LZ77) jäsennykseen. LZ77-jäsennys on tekstin tallennusmuoto, jossa pitkät toistuvat osajonot on korvattu viittauksilla aiempiin esiintymiin. Tekstihakurakenteiden lisäksi LZ77-jäsennystä käytetään laajasti tekstintiivistyksessä. Tutkielman kolmas osuus on sarja uusia algoritmeja LZ77-jäsennyksen muodostamiseen, sekä sisäisessä että ulkoisessa muistissa. Tutkielmassa esitellyt algoritmit ovat merkittävä parannus aiempaan tilanteeseen. Esimerkiksi: (i) uusi menetelmä LCP-taulukon muodostamiseen ulkoisessa muistissa on 2-4 kertaa aiempia menetelmiä nopeampi ja samanaikaisesti vähentää levytilankäytön neljännekseen; (ii) parannettu loppuosataulukonmuodostusalgoritmi mahdollistaa paljon aiemmin nähtyjä suurempien syötteiden käsittelyn. Kokeissa yhden teratavun kokoisen tiedoston loppuosataulukon muodostaminen vei vähän yli viikon ja vaati vain 7,2 teratavua levytilaa (syöte ja tulos mukaanlukien), kun aiemmat menetelmät olisivat vaatineet 3,5-kertaisen määrän levytilaa ja vieneet noin nelinkertaisen ajan

Helsingin yliopiston digitaalinen arkisto